context realization
- Europe > Switzerland > Zürich > Zürich (0.04)
- North America > Canada (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
- North America > Canada (0.04)
Contextual Blocking Bandits
Basu, Soumya, Papadigenopoulos, Orestis, Caramanis, Constantine, Shakkottai, Sanjay
We study a novel variant of the multi-armed bandit problem, where at each time step, the player observes a context that determines the arms' mean rewards. However, playing an arm blocks it (across all contexts) for a fixed number of future time steps. This model extends the blocking bandits model (Basu et al., NeurIPS19) to a contextual setting, and captures important scenarios such as recommendation systems or ad placement with diverse users, and processing diverse pool of jobs. This contextual setting, however, invalidates greedy solution techniques that are effective for its non-contextual counterpart. Assuming knowledge of the mean reward for each arm-context pair, we design a randomized LP-based algorithm which is $\alpha$-optimal in (large enough) $T$ time steps, where $\alpha = \tfrac{d_{\max}}{2d_{\max}-1}\left(1- \epsilon\right)$ for any $\epsilon >0$, and $d_{max}$ is the maximum delay of the arms. In the bandit setting, we show that a UCB based variant of the above online policy guarantees $\mathcal{O}\left(\log T\right)$ regret w.r.t. the $\alpha$-optimal strategy in $T$ time steps, which matches the $\Omega(\log(T))$ regret lower bound in this setting. Due to the time correlation caused by the blocking of arms, existing techniques for upper bounding regret fail. As a first, in the presence of such temporal correlations, we combine ideas from coupling of non-stationary Markov chains and opportunistic sub-sampling with suboptimality charging techniques from combinatorial bandits to prove our regret upper bounds.
- North America > United States > Texas > Travis County > Austin (0.14)
- North America > United States > Hawaii (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- Information Technology > Data Science > Data Mining > Big Data (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.48)
Stochastic Bandits with Context Distributions
Kirschner, Johannes, Krause, Andreas
We introduce a novel stochastic contextual bandit model, where at each step the adversary chooses a distribution over a context set. The learner observes only the context distribution while the exact context realization remains hidden. This allows for a broader range of applications, for instance when the context itself is based on predictions. By leveraging the UCB algorithm to this setting, we propose an algorithm that achieves a $\tilde{\mathcal{O}}(d\sqrt{T})$ high-probability regret bound for linearly parametrized reward functions. Our results strictly generalize previous work in the sense that both our model and the algorithm reduce to the standard setting when the environment chooses only Dirac delta distributions and therefore provides the exact context to the learner. We further obtain similar results for a variant where the learner observes the realized context after choosing the action, and we extend the results to the kernelized setting. Finally, we demonstrate the proposed method on synthetic and real-world datasets.